This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. The example below will print a statement and run a quick computation.
#Quick R example
print('ASAS NANP Workshop Demo' )
## [1] "ASAS NANP Workshop Demo"
#Quick R Example
apple=5
orange=10
apple+orange
## [1] 15
The objectives of this hands on training are: 1) introduce workshop participants to methods for streamlining data processing tasks in R, 2) demonstrate and provide examples of compiling large accelerometer datasets for determining daily livestock behavior; 3) introduce a suite of classification algorithms and validation testing approaches for classifying accelerometer training datasets, and 4) utilize model predictions to estimate and analyze daily behavior for beef cattle. The dataset used in this tutorial is a subset of data from Brennan et al. 2021.
Our first step to processing accelerometer data is to import the libraries we will use to run our analysis. Each library contains a set of functions which can be used to process data. For example, the function mean() would sum the values in a column and divide by the number of observations in the column. This code will look to see if the necessary packages are installed on your computer and if not install and load them.
##if there is an error and a package or dependency needs to be updated un-comment the code below and replace 'vctrs' with package
#remove.packages('vctrs')
#install.packages('vctrs')
#Needed packages
list.of.packages <- c("lubridate","ggplot2",'dplyr','randomForest','plotly','class','caret','MASS','knitr','markdown','rpart.plot')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(lubridate)
library(ggplot2)
library(dplyr)
library(randomForest)
library(plotly)
library(caret)
library(class)
library(MASS)
library(rpart)
library(e1071)
library(knitr)
library(markdown)
library(rpart.plot)
The example dataset was generated using Gulf Coast X-16 mini accelerometer data logger set at 12hz and placed on a yearling steer in 2017. The data logger generates a file once 1,000,000 records have been recorded which is equal to roughly 1 day of data. Included in this tutorial are 5 raw datasets which is about 5 days worth of data. First we need to set the working directory to where the dataset is located.You can do this programatically or by clicking session -> set working directory -> choose directory. We will load in the dataset using the read.csv command and set the header to false and then view the data.
#Set working directory
#setwd("~/Conferences/NANP_2022/Workshop")
#Load in the raw data file and view first 15 records
Accel_df=read.csv('DATA-021.csv',header=F)
head(Accel_df,n=15L)
## V1 V2 V3
## 1 ;Title http://www.gcdataconcepts.com X16-mini
## 2 ;Version 1113 Build date
## 3 ;Start_time 2017-06-10 14:22:08.011
## 4 ;Temperature -999.00 deg C
## 5 ;SampleRate 12 Hz
## 6 ;Deadband 0 counts
## 7 ;DeadbandTimeout 0 sec
## 8 ;Time Ax Ay
## 9 0.045 789 -1404
## 10 0.128 1096 -1843
## 11 0.211 1060 -1116
## 12 0.294 719 -429
## 13 0.377 749 -720
## 14 0.460 1012 -1050
## 15 0.543 971 -1088
## V4 V5 V6
## 1 Analog Dev ADXL345
## 2 Jan 6 2016 SN:CCDC1016DEE0180
## 3
## 4 Vbat 4062 mv
## 5
## 6
## 7
## 8 Az
## 9 1451
## 10 1685
## 11 1576
## 12 1701
## 13 1587
## 14 1028
## 15 1190
As you can see from the output, the data comes with a lot of unnecessary header information. Of interest for us is the Time stamp, X, Y, and Z data, which starts on line 8. In addition, the beginning time is located on line 3 column v2 and v3. We will need to add our time column to this beginning starting point to calculate the time stamp for each record.
#Extract start date and time and convert to date time object
#note of row and column in slides but don't cover
start_time = paste(Accel_df[3,2],Accel_df[3,3])
start_time=as.POSIXct(start_time,format ="%Y-%m-%d %H:%M:%S")
#delete first 8 rows from dataframe
Accel_df=Accel_df[-c(1:8),]
#Use NULL to remove excess columns
Accel_df$V5=NULL
Accel_df$V6=NULL
#rename columns and convert to numeric
colnames(Accel_df)=c("Time","Ax","Ay","Az")
Accel_df$Time=as.numeric(Accel_df$Time)
Accel_df$Ax=as.numeric(Accel_df$Ax)
Accel_df$Ay=as.numeric(Accel_df$Ay)
Accel_df$Az=as.numeric(Accel_df$Az)
#add start time to time and display cleaned up header dataframe
Accel_df$Time= start_time+Accel_df$Time
rownames(Accel_df) <- NULL
head(Accel_df)
## Time Ax Ay Az
## 1 2017-06-10 14:22:08 789 -1404 1451
## 2 2017-06-10 14:22:08 1096 -1843 1685
## 3 2017-06-10 14:22:08 1060 -1116 1576
## 4 2017-06-10 14:22:08 719 -429 1701
## 5 2017-06-10 14:22:08 749 -720 1587
## 6 2017-06-10 14:22:08 1012 -1050 1028
Next we need to convert the data to G-forces (g’s) (which is the unit of measure for accelerometers). Per the company user manual, the X, Y, and Z variables raw output need to be divided by 2048 to convert to g’s. In addition, we want to calculate two other variables: movement intensity (MI) and signal amplitude (SMA). Previous research has demonstrated that these combined variables are beneficial for classifying livestock behavior. The formulas for those are:
\[ MI = \sqrt { x^2 + y^2 + z^2} \] \[ SMA= \lvert x \rvert + \lvert y \rvert + \lvert z \rvert \]
#convert to g
Accel_df$Ax=Accel_df$Ax/2048
Accel_df$Ay=Accel_df$Ay/2048
Accel_df$Az=Accel_df$Az/2048
#Calculate MI and SMA
Accel_df$MI=sqrt(Accel_df$Ax^2 + Accel_df$Ay^2 + Accel_df$Az^2)
Accel_df$SMA=abs(Accel_df$Ax) + abs(Accel_df$Ay) + abs(Accel_df$Az)
head(Accel_df)
## Time Ax Ay Az MI SMA
## 1 2017-06-10 14:22:08 0.3852539 -0.6855469 0.7084961 1.0584714 1.779297
## 2 2017-06-10 14:22:08 0.5351562 -0.8999023 0.8227539 1.3315932 2.257812
## 3 2017-06-10 14:22:08 0.5175781 -0.5449219 0.7695312 1.0756418 1.832031
## 4 2017-06-10 14:22:08 0.3510742 -0.2094727 0.8305664 0.9257281 1.391113
## 5 2017-06-10 14:22:08 0.3657227 -0.3515625 0.7749023 0.9261873 1.492188
## 6 2017-06-10 14:22:08 0.4941406 -0.5126953 0.5019531 0.8711994 1.508789
Creating time windows can reduce noise from the raw accelerometer output and help with classification. To accomplish this, we will round the time up to the nearest 5 second interval. You can change the interval based on desired level of aggregation (Gonzalez et al., 2015). From this rounded time we will compute the mean, minimum, maximum, and standard error of our X, Y, Z, MI, and SMA variables every 5 seconds. We will then merge these calculated time window measurements into a single data frame.
#round time to 5 second intervals
Accel_df$Time=lubridate::ceiling_date(Accel_df$Time,unit = "5 seconds")
#González, Luciano A., et al. "Behavioral classification of data from collars containing motion sensors in grazing cattle." Computers and electronics in #agriculture 110 (2015): 91-102.
#Create Standard Error Function
standard_error <- function(x) sd(x) / sqrt(length(x))
#Calculate Mean, Max, Min, and SD for each 5 second time stamp
Accel_mean=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=mean)
colnames(Accel_mean)=c("Time","X_Mean","Y_Mean","Z_Mean", "MI_Mean","SMA_Mean")
Accel_min=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=min)
colnames(Accel_min)=c("Time","X_Min","Y_Min","Z_Min","MI_Min","SMA_Min")
Accel_max=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=max)
colnames(Accel_max)=c("Time","X_Max","Y_Max","Z_Max","MI_Max","SMA_Max")
Accel_SE=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=standard_error )
colnames(Accel_SE)=c("Time","X_SE","Y_SE","Z_SE","MI_SE","SMA_SE")
#Combine into one dataframe
Accel_df=list(Accel_mean,Accel_max,Accel_min,Accel_SE)
Accel_df=Reduce(function(x, y) merge(x, y, all=TRUE), Accel_df)
head(Accel_df)
## Time X_Mean Y_Mean Z_Mean MI_Mean SMA_Mean
## 1 2017-06-10 14:22:10 0.4301961 -0.5261841 0.7205200 1.009853 1.676900
## 2 2017-06-10 14:22:15 0.4179036 -0.5146240 0.7266846 1.002592 1.659212
## 3 2017-06-10 14:22:20 0.3803467 -0.5254639 0.7467855 1.004776 1.652596
## 4 2017-06-10 14:22:25 0.4224513 -0.5035781 0.7137071 0.986538 1.639736
## 5 2017-06-10 14:22:30 0.3697428 -0.5133952 0.7609701 1.002796 1.644108
## 6 2017-06-10 14:22:35 0.4598307 -0.4775228 0.7189290 1.002290 1.656283
## X_Max Y_Max Z_Max MI_Max SMA_Max X_Min Y_Min
## 1 0.7519531 -0.2094727 1.0322266 1.335298 2.282715 0.2153320 -0.8999023
## 2 0.7592773 -0.1806641 1.0336914 1.364981 2.311523 0.1557617 -0.8886719
## 3 0.6918945 -0.1572266 1.1235352 1.414516 2.330078 0.2045898 -0.9643555
## 4 0.6606445 -0.1582031 0.9902344 1.305036 2.239258 0.2246094 -0.8476562
## 5 0.5810547 -0.2685547 1.1215820 1.381742 2.234375 0.2109375 -0.7763672
## 6 0.8901367 -0.1738281 1.0405273 1.373862 2.300293 0.2275391 -0.9233398
## Z_Min MI_Min SMA_Min X_SE Y_SE Z_SE MI_SE
## 1 0.4555664 0.8391981 1.348633 0.02507613 0.03454523 0.03050527 0.03270258
## 2 0.4599609 0.8359342 1.314941 0.01742707 0.01874789 0.01774744 0.01822701
## 3 0.4584961 0.7883861 1.252441 0.01233284 0.01987365 0.01925812 0.01976474
## 4 0.4262695 0.7809516 1.207520 0.01507468 0.01756646 0.01633678 0.01650232
## 5 0.3808594 0.6734779 1.093262 0.01211075 0.01572578 0.02093040 0.01967968
## 6 0.3720703 0.8271793 1.252930 0.01891206 0.02135721 0.01849351 0.01838327
## SMA_SE
## 1 0.05946504
## 2 0.03348271
## 3 0.03364648
## 4 0.03047485
## 5 0.03114984
## 6 0.03254150
Now that we have aggregated our data by 5 second intervals we have reduced the data size from 1,000,000 records to 16,457 records, which is much more manageable. We can also start to plot data to see if there are any obvious activity patterns throughout the day.
ggplot2::ggplot(Accel_df,aes(x=Time,y=MI_Mean))+
geom_line(color='Red')+
ylab("MI Mean")+
ggtitle('Movement Intensity Five Second Mean')
ggplot2::ggplot(Accel_df)+
geom_line(aes(x=Time,y=X_Max),color='red')+
geom_line(aes(x=Time,y=Y_Max),color='blue')+
geom_line(aes(x=Time,y=Z_Max),color='black')+
ylab("g")+
ggtitle('X, Y, and Z Maximum Values')
There are numerous steps that were required to process the dataset into a useable format that can be incorporated into machine learning models. In program R, we can use existing functions to help process and analyze our data, but we can also write our own functions.This can be helpful to simplify code and streamline data processing. Below we will wrap all of the previous data processing steps into a single custom function. The function is named Accel_function and takes the input datafile.
Accel_function=function (datafile){
#Load in the raw data file and view first 15 records
Accel_df=read.csv(datafile,header=F)
start_time = paste(Accel_df[3,2],Accel_df[3,3])
start_time=as.POSIXct(start_time,format ="%Y-%m-%d %H:%M:%S")
#delete first 8 rows from dataframe and remove unneeded blank rows
Accel_df=Accel_df[-c(1:8),]
Accel_df$V5=NULL
Accel_df$V6=NULL
#rename columns and convert to numeric
colnames(Accel_df)=c("Time","Ax","Ay","Az")
Accel_df$Time=as.numeric(as.character(Accel_df$Time))
Accel_df$Ax=as.numeric(as.character(Accel_df$Ax))
Accel_df$Ay=as.numeric(as.character(Accel_df$Ay))
Accel_df$Az=as.numeric(as.character(Accel_df$Az))
#add start time to time and display cleaned up header dataframe
Accel_df$Time= start_time+Accel_df$Time
rownames(Accel_df) <- NULL
#convert to g
Accel_df$Ax=Accel_df$Ax/2048
Accel_df$Ay=Accel_df$Ay/2048
Accel_df$Az=Accel_df$Az/2048
#Calculte MI and SMA
Accel_df$MI=sqrt(Accel_df$Ax^2 + Accel_df$Ay^2 + Accel_df$Az^2)
Accel_df$SMA=abs(Accel_df$Ax) + abs(Accel_df$Ay) + abs(Accel_df$Az)
#round time to 5 second intervals
Accel_df$Time=lubridate::ceiling_date(Accel_df$Time,unit = "5 seconds")
standard_error <- function(x) sd(x) / sqrt(length(x))
#Calculate Mean, Max, Min, and SD for each 5 second time stamp
Accel_mean=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=mean)
colnames(Accel_mean)=c("Time","X_Mean","Y_Mean","Z_Mean", "MI_Mean","SMA_Mean")
Accel_min=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=min)
colnames(Accel_min)=c("Time","X_Min","Y_Min","Z_Min","MI_Min","SMA_Min")
Accel_max=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=max)
colnames(Accel_max)=c("Time","X_Max","Y_Max","Z_Max","MI_Max","SMA_Max")
Accel_SE=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=standard_error )
colnames(Accel_SE)=c("Time","X_SE","Y_SE","Z_SE","MI_SE","SMA_SE")
#Combine into one dataframe
Accel_df=list(Accel_mean,Accel_max,Accel_min,Accel_SE)
Accel_df=Reduce(function(x, y) merge(x, y, all=TRUE), Accel_df)
return(Accel_df)
}
We can now call our function and process a datafile with a single line of code and display the output.
Accel_data=Accel_function('DATA-022.csv')
head(Accel_data)
## Time X_Mean Y_Mean Z_Mean MI_Mean SMA_Mean
## 1 2017-06-11 13:15:20 0.3492635 -0.4703878 0.7804871 0.9853950 1.600138
## 2 2017-06-11 13:15:25 0.5361654 -0.3397054 0.7276042 0.9803144 1.603475
## 3 2017-06-11 13:15:30 0.6746971 -0.2567697 0.6084895 1.0119527 1.580492
## 4 2017-06-11 13:15:35 0.5690267 -0.4343913 0.6648600 1.0152054 1.668278
## 5 2017-06-11 13:15:40 0.5214844 -0.4165853 0.7055583 0.9847082 1.643628
## 6 2017-06-11 13:15:45 0.4854818 -0.4696533 0.7023600 0.9894400 1.657495
## X_Max Y_Max Z_Max MI_Max SMA_Max X_Min Y_Min
## 1 0.6884766 -0.24951172 0.9526367 1.240585 2.074707 0.200195312 -0.7568359
## 2 0.7416992 -0.15820312 0.9106445 1.100161 1.818359 0.115722656 -0.8256836
## 3 1.2934570 0.32861328 1.4755859 1.889156 3.264160 0.003417969 -1.0454102
## 4 1.3168945 -0.02441406 1.0747070 1.648346 2.719727 0.003906250 -0.9428711
## 5 0.7270508 -0.21191406 1.1723633 1.423899 2.313477 0.182617188 -0.7153320
## 6 0.9204102 -0.13183594 0.8925781 1.177397 1.958008 0.325683594 -0.8354492
## Z_Min MI_Min SMA_Min X_SE Y_SE Z_SE MI_SE
## 1 0.60937500 0.8494963 1.2895508 0.01395528 0.01529155 0.010543921 0.012006165
## 2 0.58984375 0.9056680 1.4438477 0.01577842 0.01438799 0.007520017 0.005047960
## 3 -0.03173828 0.2503326 0.3183594 0.03613918 0.03193671 0.039869168 0.040356071
## 4 0.35302734 0.5414528 0.7905273 0.02854079 0.02475697 0.019483584 0.022737444
## 5 0.47998047 0.6477652 1.0717773 0.01509846 0.01312463 0.014251036 0.012519064
## 6 0.56201172 0.8579398 1.4350586 0.01626596 0.01469610 0.008986108 0.007921131
## SMA_SE
## 1 0.022556001
## 2 0.009157867
## 3 0.067388964
## 4 0.038089568
## 5 0.020227060
## 6 0.012894330
In this example, there are 5 data files that need to be processed and merged together from one animal. You could run the function five separate times and merge each file together. However, if this was the full accelerometer dataset with 90+ files for each individual animal this would become a labor intensive process. To further automate the process, we will extract the names of all the files in the directory that match a pattern, in this case ‘DATA-’, and apply our function to each of these files. We will then bind the rows together to make one dataset from all the files.
#extract the names of all files that match the string 'DATA-'
filenames=list.files(getwd(),pattern = "DATA-",all.files = FALSE)
#Process the list of files to create a data frame with all five data files merged together
Accel_Merged <- dplyr::bind_rows(lapply(filenames[1:length(filenames)], Accel_function))
#Print Dimensions of Data
dim(Accel_Merged)
## [1] 82281 21
Using the dim function (dimensions), we see that we now have a dataframe with 82,281 rows and 21 columns. Though it took a few minutes to run, we were able to process 5 million raw accelerometer records split into 5 files with just a few lines of code. From the plot below you can now see the signal amplitude standard error calculated every five seconds for a five day period.
ggplot2::ggplot(Accel_Merged,aes(x=Time,y=SMA_SE))+
geom_line(color='Red')+
ylab("SMA Standard Error")+
ggtitle('Signal Amplitude Standard Error Five Second Mean')
Now that we have demonstrated on to process the raw accelerometer files, we will move onto the training data set for our machine learning models. This data set consists of 11,478 labeled 5 second accelerometer observations. Visual observations of individual animals in the field were used to label the data. First thing we will do is load in the training dataset, look at the column names, and the number of observations for each behavior using the ‘table’ function.
observed_df=read.csv('Model_Training_Data.csv')
#Set Behavior as factor
observed_df$Behavior=as.factor(observed_df$Behavior)
#print column names
colnames(observed_df)
## [1] "Time" "X_Mean" "Y_Mean" "Z_Mean" "MI_Mean" "SMA_Mean"
## [7] "X_Max" "Y_Max" "Z_Max" "MI_Max" "SMA_Max" "X_Min"
## [13] "Y_Min" "Z_Min" "MI_Min" "SMA_Min" "MI_SE" "SMA_SE"
## [19] "X_SE" "Y_SE" "Z_SE" "Behavior"
#print count of each behavior
table(observed_df$Behavior)
##
## G R W
## 6738 4520 220
From our output, we can see that the dataset contains the same variables as our processed dataset. In addition, there is another column labeled ‘Behavior’. The output from the ‘table’ function shows that there are 6,738 G (graze) observations, 4520 R (resting) observations, and 220 W (Walking) observations.
One of the first steps to analyzing data is to visualize the data by making a number of different plots. This can help us understand data distributions, challenges that may exists in the modeling process, and sources of error.
ggplot(observed_df, aes(x=X_Mean,color=Behavior)) +
geom_histogram()+
facet_wrap(~Behavior)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the histogram, we can see that grazing starts declining sharply
between 0 and 0.25 while the resting increase during this period. There
is some overlap between the behaviors that might be a source of model
error.
ggplot(observed_df, aes(x=MI_SE,color=Behavior)) +
geom_density(lwd=1.5)
The density plot shows a shift to the left (higher standard error
measurements) of the grazing and walking behavior versus the resting
behavior.
ggplot(observed_df,aes(x=MI_SE,y=X_SE,color=Behavior))+
geom_point()+
xlim(0,0.05)+
xlab("X Standard Error")+
ylab('MI Standard Error')
## Warning: Removed 25 rows containing missing values (geom_point).
From our scatter plot we can see a good separation of grazing and
resting behavior. We can even view the plots in 3d to help visualize
separation of behaviors in space.
plotly:: plot_ly(observed_df, x=~SMA_SE, y=~X_Mean,
z=~MI_Min, color=~Behavior)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
For this example we are going to use two different validation approaches for assessing our model accuracy.
The validation set approach (VSA) divides the observation dataset into a training and testing dataset. The training dataset is used to fit the different models which are then used to predict behavior on the test dataset to get an unbiased estimate of model accuracy.
The second approach will use cross validation (CV). CV splits the data into a number of different groups (folds). The model is fit iteratively and tested on each fold. The average error rate on all folds will be reported. For many machine learning applications, 5 or 10 fold cross-validation is commonly used. For our example, we will use 5. As we increase folds we increase the computational time to run the analysis.
Before we split our dataset, we want to use the ‘set.seed’ function. Although it will randomly split the dataset into a train/test, by setting the seed, we can reproduce our results because it will use the same random sample each time. In the code below, we will split our dataset into an 80/20 train/test dataset.
#setting seed allows us to reproduce the exact results
set.seed(314)
#This example will do a 80/20 train/test. You can change the 0.8 to alter this ratio
train_data_index <- sample(1:nrow(observed_df), 0.8 * nrow(observed_df))
test_data_index <- setdiff(1:nrow(observed_df), train_data_index)
# Build train and test datset
train_data <- observed_df[train_data_index,]
test_data <- observed_df[test_data_index, ]
The first method we will test is k-nearest neighbor. This will fit a KNN model, apply it to the test dataset, and compare the observed versus predicted values to assess model accuracy.
#KNN VSA method
#create train dataset with only predictors, columns 2-21 are our predictor variables (from the accelerometer)
train.knn=train_data[,2:21]
#test dataset only predictors
test.knn=test_data[,2:21]
#fit KNN <-#HMM EDIT## model with three nearest neighbors, assign prediction to test_data column KNN
test_data$KNN=knn(train.knn,test.knn,train_data$Behavior,k=3) #K=3?
#compare prediction with observed on test dataset
caret::confusionMatrix(test_data$KNN,test_data$Behavior)
## Confusion Matrix and Statistics
##
## Reference
## Prediction G R W
## G 1277 81 19
## R 65 809 6
## W 14 3 22
##
## Overall Statistics
##
## Accuracy : 0.9181
## 95% CI : (0.9061, 0.929)
## No Information Rate : 0.5906
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.835
##
## Mcnemar's Test P-Value : 0.3193
##
## Statistics by Class:
##
## Class: G Class: R Class: W
## Sensitivity 0.9417 0.9059 0.468085
## Specificity 0.8936 0.9494 0.992441
## Pos Pred Value 0.9274 0.9193 0.564103
## Neg Pred Value 0.9140 0.9407 0.988923
## Prevalence 0.5906 0.3889 0.020470
## Detection Rate 0.5562 0.3524 0.009582
## Detection Prevalence 0.5997 0.3833 0.016986
## Balanced Accuracy 0.9177 0.9277 0.730263
#store accuracy for table later on
knn.vsa=as.numeric(caret::confusionMatrix(test_data$KNN,test_data$Behavior)$overall[1])
Next we will fit a KNN model using the cross validation approach. Note that this approach uses the observed dataset. For each iteration, a KNN model is fit on 80% of the data and tested on 20%. The accuracy reported is the average of those 5 iterations.
#KNN 5 fold Cross validation
#library(caret) #already loaded caret above? HMM##
set.seed(314)
#Train model using 10 fold cv
knn.cv=train(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
tuneGrid=expand.grid(k=3),
method='knn',
trControl=trainControl(method = "cv",number=5), #change number to change number of folds
metric="Accuracy",
data = observed_df)
knn.cv
## k-Nearest Neighbors
##
## 11478 samples
## 20 predictor
## 3 classes: 'G', 'R', 'W'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 9182, 9182, 9183, 9183, 9182
## Resampling results:
##
## Accuracy Kappa
## 0.9214148 0.8409239
##
## Tuning parameter 'k' was held constant at a value of 3
#store accuracy for later
knn.cv=knn.cv$results$Accuracy
Our next function to test is linear discriminant analysis (LDA).
#####LDA VSA Approach
lda.vsa=lda(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
data=train_data)
#predict behavior on test dataset using the model
test_data$LDA_VSA=predict(lda.vsa,test_data,type="response")$class
caret::confusionMatrix(test_data$LDA_VSA,test_data$Behavior)
## Confusion Matrix and Statistics
##
## Reference
## Prediction G R W
## G 1289 85 36
## R 64 808 10
## W 3 0 1
##
## Overall Statistics
##
## Accuracy : 0.9138
## 95% CI : (0.9015, 0.9249)
## No Information Rate : 0.5906
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8232
##
## Mcnemar's Test P-Value : 6.924e-09
##
## Statistics by Class:
##
## Class: G Class: R Class: W
## Sensitivity 0.9506 0.9048 0.0212766
## Specificity 0.8713 0.9473 0.9986661
## Pos Pred Value 0.9142 0.9161 0.2500000
## Neg Pred Value 0.9244 0.9399 0.9799302
## Prevalence 0.5906 0.3889 0.0204704
## Detection Rate 0.5614 0.3519 0.0004355
## Detection Prevalence 0.6141 0.3841 0.0017422
## Balanced Accuracy 0.9109 0.9260 0.5099713
#store model accuracy
LDA_VSA=as.numeric(caret::confusionMatrix(test_data$LDA_VSA,test_data$Behavior)$overall[1])
We can also generate stacked histograms of our predictions.
ldahist(test_data$X_Mean,test_data$LDA_VSA)
To run the CV on LDA we will write our own function.
#We need to create a cross validation function. This will loop through each k fold, fit the model, and store and average the error rates
cv.lda <-function (data, model=origin~., yname="origin", K=5, seed=314) {
n <- nrow(data)
set.seed(seed)
datay=data[,yname] #response variable
#partition the data into K subsets
f <- ceiling(n/K)
s <- sample(rep(1:K, f), n)
#generate indices 1:10 and sample n of them
# K fold cross-validated error
CV=NULL
for (i in 1:K) { #i=1
test.index <- seq_len(n)[(s == i)] #test data
train.index <- seq_len(n)[(s != i)] #training data
#model with training data
lda.fit=lda(model, data=data[train.index,])
#observed test set y
lda.y <- data[test.index, yname]
#predicted test set y
lda.predy=predict(lda.fit, data[test.index,])$class
#observed - predicted on test data
error= mean(lda.y!=lda.predy)
#error rates
CV=c(CV,error)
}
#Output
list(call = model, K = K,error=CV,
lda_error_rate = mean(CV), seed = seed)
}
#Use our function to run the CV
lda.kfold=cv.lda(data=observed_df,
model = Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
yname="Behavior",
K=5,
seed = 314)
#Show output and store accuracy
lda.kfold
## $call
## Behavior ~ X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max +
## Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min +
## MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE
##
## $K
## [1] 5
##
## $error
## [1] 0.08972125 0.08667247 0.09277003 0.07360627 0.08761988
##
## $lda_error_rate
## [1] 0.08607798
##
## $seed
## [1] 314
lda.cv=1-lda.kfold$lda_error_rate
Random forest is a machine learning method that uses recursive partitioning (decision trees) to build a model to break our data down into a series of splits resulting in a prediction. A simple example from our dataset can be seen below.
#Example of recursive partitioning
rpart_vsa=rpart(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE
,data=observed_df,control = rpart.control(10)) #rpart parameter?10?
#Plot the decision tree
rpart.plot:: rpart.plot(rpart_vsa)
Though these models are useful for interpretation, more complex tree based models often have better performance. Random forest uses a similar process except it fits a large number of trees using bootstrapped sample and random number of features for each tree.
#Set Seed
set.seed(314)
#Fit Random Forest Model
rf_vsa=randomForest( Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE
,ntree=1000,data=train_data)
#Predict Behavior on test dataset
test_data$rf_vsa=predict(rf_vsa,newdata=test_data)
#Generate confusion matrix and save accuracy
caret::confusionMatrix(test_data$rf_vsa,test_data$Behavior)
## Confusion Matrix and Statistics
##
## Reference
## Prediction G R W
## G 1301 41 24
## R 54 852 9
## W 1 0 14
##
## Overall Statistics
##
## Accuracy : 0.9438
## 95% CI : (0.9336, 0.9529)
## No Information Rate : 0.5906
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8861
##
## Mcnemar's Test P-Value : 5.391e-07
##
## Statistics by Class:
##
## Class: G Class: R Class: W
## Sensitivity 0.9594 0.9541 0.297872
## Specificity 0.9309 0.9551 0.999555
## Pos Pred Value 0.9524 0.9311 0.933333
## Neg Pred Value 0.9409 0.9703 0.985533
## Prevalence 0.5906 0.3889 0.020470
## Detection Rate 0.5666 0.3711 0.006098
## Detection Prevalence 0.5949 0.3985 0.006533
## Balanced Accuracy 0.9451 0.9546 0.648714
vsa_rf=as.numeric(caret::confusionMatrix(test_data$rf_vsa,test_data$Behavior)$overall[1])
We can also look at what variable are most important in the model.
varImpPlot(rf_vsa,main="Variable Importance Plot RF Model")
Again, for the RF model we will use our own cross validation function to test model accuracy.
rf.cv=function (data, model=origin~., yname="origin", K=5, seed=314) {
n <- nrow(data)
set.seed(seed)
datay=data[,yname] #response variable
#partition the data into K subsets
f <- ceiling(n/K)
s <- sample(rep(1:K, f), n)
#generate indices 1:10 and sample n of them
# K fold cross-validated error
CV=NULL
#i=3
for (i in 1:K) { #i=1
test.index <- seq_len(n)[(s == i)] #test data
train.index <- seq_len(n)[(s != i)] #training data
#model with training data
rf.fit=randomForest(model, data=data[train.index,])
#observed test set y
rf.y <- data[test.index, yname]
#predicted test set y
rf.predy=predict(rf.fit, data[test.index,])
#observed - predicted on test data
error= mean(rf.y!=rf.predy)
#error rates
CV=c(CV,error)
}
#Output
list(call = model, K = K,error=CV,
rf_error_rate = mean(CV), seed = seed)
}
#Run function for cross validation using random forest
cv_rf=rf.cv(data = observed_df,
model = Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
yname="Behavior",
K=5,
seed = 314)
#Cross validation output
cv_rf
## $call
## Behavior ~ X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max +
## Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min +
## MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE
##
## $K
## [1] 5
##
## $error
## [1] 0.05923345 0.05792683 0.05487805 0.04355401 0.05100262
##
## $rf_error_rate
## [1] 0.05331899
##
## $seed
## [1] 314
#store accuracy
cv_rf=1-cv_rf$rf_error_rate
The next model we will fit is a support vector machine (SVM) using the validation set approach
#Fit SVM model
svm_mod=svm(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
data = train_data,kernel='linear')
#Use model to predict on test dataset
test_data$SVM_VSA=predict(svm_mod,test_data)
#Calculate confusion matrix and save accuracy
svm_vsa=as.numeric(caret::confusionMatrix(test_data$Behavior,test_data$SVM_VSA)$overall[1])
caret::confusionMatrix(test_data$Behavior,test_data$SVM_VSA)
## Confusion Matrix and Statistics
##
## Reference
## Prediction G R W
## G 1285 71 0
## R 71 822 0
## W 39 8 0
##
## Overall Statistics
##
## Accuracy : 0.9177
## 95% CI : (0.9057, 0.9286)
## No Information Rate : 0.6076
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8315
##
## Mcnemar's Test P-Value : 3.476e-10
##
## Statistics by Class:
##
## Class: G Class: R Class: W
## Sensitivity 0.9211 0.9123 NA
## Specificity 0.9212 0.9491 0.97953
## Pos Pred Value 0.9476 0.9205 NA
## Neg Pred Value 0.8830 0.9437 NA
## Prevalence 0.6076 0.3924 0.00000
## Detection Rate 0.5597 0.3580 0.00000
## Detection Prevalence 0.5906 0.3889 0.02047
## Balanced Accuracy 0.9212 0.9307 NA
SVM using 5 fold CV.
svm.cv=function (data, model=origin~., yname="origin", K=5, seed=314) {
n <- nrow(data)
set.seed(seed)
datay=data[,yname] #response variable
#partition the data into K subsets
f <- ceiling(n/K)
s <- sample(rep(1:K, f), n)
#generate indices 1:10 and sample n of them
# K fold cross-validated error
CV=NULL
#i=3
for (i in 1:K) { #i=1
test.index <- seq_len(n)[(s == i)] #test data
train.index <- seq_len(n)[(s != i)] #training data
#model with training data
svm.fit=svm(model, data=data[train.index,],kernel='linear')
#observed test set y
svm.y <- data[test.index, yname]
#predicted test set y
svm.predy=predict(svm.fit, data[test.index,])
#observed - predicted on test data
error= mean(svm.y!=svm.predy)
#error rates
CV=c(CV,error)
}
#Output
list(call = model, K = K,error=CV,
svm_error_rate = mean(CV), seed = seed)
}
#Run function for cross validation using random forest
cv_svm=svm.cv(data = observed_df,
model = Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
yname="Behavior",
K=5,
seed = 314)
#store accuracy
cv_svm=1-cv_svm$svm_error_rate
#Cross validation output
cv_svm
## [1] 0.9174071
Now that we have run our selected machine learning models, we can display the accuracy for each model and testing scheme into a table. Note that though we ran 4 models above, there are many more models that could be run and compared for accuracy.
#create dataframe of accuracy and models
final_table=as.data.frame(rbind(c(knn.vsa*100,knn.cv*100),c(LDA_VSA*100,lda.cv*100),c(vsa_rf*100,cv_rf*100),c(svm_vsa*100,cv_svm*100)))
final_table$Model=c("KNN","LDA","RF","SVM")
colnames(final_table)=c("VSA Accuracy","10-Fold CV Accuracy","Model")
final_table<- final_table[, c(3,1,2)]
rownames(final_table)=NULL
knitr::kable(final_table,digits=1,caption="Model Accuracy (%) for Validation and CV Approaches")
| Model | VSA Accuracy | 10-Fold CV Accuracy |
|---|---|---|
| KNN | 91.8 | 92.1 |
| LDA | 91.4 | 91.4 |
| RF | 94.4 | 94.7 |
| SVM | 91.8 | 91.7 |
For the training examples above, we ran each of our models with the default settings and evaluated their accuracy. For most machine learning models, their are additional parameters that can be adjusted that may improve our model accuracy. For example, for KNN we used 3 neighbors to make predictions, but we could also evaluate which number of neighbors provides the optimal solution. In the example below, we will use grid search to test the cross validation accuracy of 1,3,5,7 neighbors.
#KNN 10 fold Cross validation
#library(caret) #already loaded. (Is attached used?HMM)
set.seed(314)
#Train model using 10 fold cv
knn.cv=train(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
tuneGrid=expand.grid(k=c(1,3,5,7)),
method='knn',
trControl=trainControl(method = "cv",number=5), #change number to change number of folds
metric="Accuracy",
data = observed_df)
knn.cv
## k-Nearest Neighbors
##
## 11478 samples
## 20 predictor
## 3 classes: 'G', 'R', 'W'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 9182, 9182, 9183, 9183, 9182
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9073881 0.8143589
## 3 0.9211532 0.8403522
## 5 0.9228961 0.8432601
## 7 0.9229832 0.8430419
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
#store accuracy for later
knn.cv=knn.cv$results$Accuracy
We can see from the output that the model with k=7 neighbors provided the best results.
Other models such as support vector machines have similar parameters that can be tuned. For example, we use cross validation to tune the cost parameter and evaluate model accuracy.
set.seed(314)
tune_svm=tune(svm,Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
data = observed_df,kernel='linear',ranges = list(cost=c(.01,10)))
summary(tune_svm)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.08250526
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.08442149 0.006576775
## 2 10.00 0.08250526 0.006755321
We can also adjust the kernel type to see if a linear, polynomial, or radial kernel provides better results.
#Fit SVM model
svm_lin=svm(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
data = train_data,kernel='linear')
#Use model to predict on test dataset
test_data$SVM_lin=predict(svm_lin,test_data)
#Calculate confusion matrix and save accuracy
svm_lin=as.numeric(caret::confusionMatrix(test_data$Behavior,test_data$SVM_lin)$overall[1])
#Fit Radial Model
svm_rad=svm(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
data = train_data,kernel='radial')
#Use model to predict on test dataset
test_data$SVM_rad=predict(svm_rad,test_data)
#Calculate confusion matrix and save accuracy
svm_rad=as.numeric(caret::confusionMatrix(test_data$Behavior,test_data$SVM_rad)$overall[1])
#Fit Polynomial Model
svm_poly=svm(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
data = train_data,kernel='polynomial')
#Use model to predict on test dataset
test_data$SVM_poly=predict(svm_poly,test_data)
#Calculate confusion matrix and save accuracy
svm_poly=as.numeric(caret::confusionMatrix(test_data$Behavior,test_data$SVM_poly)$overall[1])
#Print accuracy of each
print(paste('Linear:',svm_lin,',', "Radial:",svm_rad,',', "Polynomial:",svm_poly))
## [1] "Linear: 0.917682926829268 , Radial: 0.937282229965157 , Polynomial: 0.930749128919861"
We can see from the output the radial kernel had the highest accuracy. By adjusting the parameters of the model we may further enhance our predictive capabilities. For now we will continue on with our random forest model that had the highest accuracy.
Based on the table above, the random forest model had the highest accuracy of all the models. We will select the RF model for predicting all our un-observed accelerometer data. To do so, we now want to fit a RF model as above, but using all available data to us. Since this uses all available data to fit the model, we are unable to get an unbiased accuracy on a testing dataset. The error estimate below is the training error rate. It is important to note that the training error is close to our testing error.
set.seed(314)
rf_deploy=randomForest( Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max +
X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE
,ntree=1000,data=observed_df)
rf_deploy
##
## Call:
## randomForest(formula = Behavior ~ X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE, data = observed_df, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.07%
## Confusion matrix:
## G R W class.error
## G 6535 195 8 0.03012763
## R 218 4302 0 0.04823009
## W 121 40 59 0.73181818
Now that we have fit our final model, we want to use it to predict behavior on all unobserved accelerometer data. To do this we will use the ‘predict’ function with the inputs of our final model (rf_deploy) and the accelerometer dataset we had processed earlier from the 5 raw files.
Accel_Merged$Behavior=predict(rf_deploy,Accel_Merged)
Though our model had an accuracy of ~95%, it is important to make sure model predictions make biological sense. For instance if our predictions on the accelerometer data indicated that an animal was resting 20 hours a day and grazing between 1-5 in the morning we might suspect that our model was doing a poor job predicting unobserved behavior. These a priori knowledge can come from experience or from visualizing the observed data in plots like histograms. The line of code below will construct a dataframe of daily behavior estimates in “minutes” for resting, walking, and grazing behavior.
#convert date time to only date
Accel_Merged$Date= as.Date(Accel_Merged$Time, format = "%m/%d/%Y")
#subset out walk predictions, count the number per day and convert back to minutes
df_pred_walk=subset(Accel_Merged,Behavior=='W')
df_pred_walk=aggregate(df_pred_walk$Behavior,by=list(c(df_pred_walk$Date)),FUN=length)
colnames(df_pred_walk)=c("Date","walk_Min")
df_pred_walk$walk_Min=(df_pred_walk$walk_Min*5)/60
#subset out graze predictions, count the number per day and convert back to minutes
df_pred_graze=subset(Accel_Merged,Behavior=='G')
df_pred_graze=aggregate(df_pred_graze$Behavior,by=list(c(df_pred_graze$Date)),FUN=length)
colnames(df_pred_graze)=c("Date","graze_Min")
df_pred_graze$graze_Min=(df_pred_graze$graze_Min*5)/60
#subset out rest predictions, count the number per day and convert back to minutes
df_pred_rest=subset(Accel_Merged,Behavior=='R')
df_pred_rest=aggregate(df_pred_rest$Behavior,by=list(c(df_pred_rest$Date)),FUN=length)
colnames(df_pred_rest)=c("Date","rest_Min")
df_pred_rest$rest_Min=(df_pred_rest$rest_Min*5)/60
#Combine into one data frame column and calculate total
df_Total=as.data.frame(cbind(as.character(df_pred_graze$Date),df_pred_graze$graze_Min,df_pred_rest$rest_Min,df_pred_walk$walk_Min))
colnames(df_Total)=c("Date","Graze_Min","Rest_Min","Walk_Min")
df_Total$Graze_Min=as.numeric(df_Total$Graze_Min)
df_Total$Rest_Min=as.numeric(df_Total$Rest_Min)
df_Total$Walk_Min=as.numeric(df_Total$Walk_Min)
df_Total$Total_Minutes=df_Total$Graze_Min+df_Total$Rest_Min+df_Total$Walk_Min
#print table as output
knitr::kable(df_Total,digits=0,caption="Daily Behavior Estimates")
| Date | Graze_Min | Rest_Min | Walk_Min | Total_Minutes |
|---|---|---|---|---|
| 2017-06-10 | 67 | 136 | 15 | 218 |
| 2017-06-11 | 598 | 740 | 102 | 1440 |
| 2017-06-12 | 526 | 827 | 87 | 1440 |
| 2017-06-13 | 595 | 735 | 110 | 1440 |
| 2017-06-14 | 587 | 716 | 136 | 1440 |
| 2017-06-15 | 276 | 565 | 38 | 879 |
Estimating daily behavior can be used in a number of applications. For example we could look to see if changes in normal daily behavior can indicate sickness in animals. Or we can use daily behavior to estimate additional energy requirements of animals based on existing equations. The example below calculates Net Energy for Activity based on the accelerometer data from our example.
#head(df_NEm_Total$Graze_Min)
df_NEm<-df_Total[2:5,] #whole days only
#### CONVERT REST MINUTES TO HOURS
#df_NEm$Graze_Min=df_NEm$Graze_Min/60
df_NEm$Rest_Min=df_NEm$Rest_Min/60
#df_NEm$Walk_Min=df_NEm$Walk_Min/60
#####COVERT Walk minutes to distance in kilometers per day
Walking_Rate<-1.05/1000 #km per second
Walk_Distance_Per_Min<-Walking_Rate*60 #in Kilometers per minute (i.e, 4 meters/minute * 60 seconds)
df_NEm$Avg_Distance_Walk<-df_NEm$Walk_Min*Walk_Distance_Per_Min
###Avg_Distance_Walk
####
Grazing_Walking_Rate<-0.093/1000 # km per second
Graze_Distance_Per_Min<-Grazing_Walking_Rate*60 #in kilometers/minute
df_NEm$Avg_Distance_Grazed<-df_NEm$Graze_Min*Graze_Distance_Per_Min
###Avg_Distance_Grazed
Fraction_Distance_Flat<-0.5 #Assume that have the distance is traveled on flat ground
df_NEm$Distance_Slope_Km<-(df_NEm$Avg_Distance_Walk+df_NEm$Avg_Distance_Grazed)*(1-Fraction_Distance_Flat)
df_NEm$Distance_flat_Km<-(df_NEm$Avg_Distance_Walk+df_NEm$Avg_Distance_Grazed)*Fraction_Distance_Flat
####################
#HARD CODED PARAMETERS FROM TEDESCHI AND FOX 2020 (Page 295 , Equation 12.14)
Standing<- df_NEm$Rest_Hours
Position_Change<-6 #number of position changes per day (e.g., resting to walking)
Average_slope <- 3.6 #(degrees) range
Average_slope_fraction = Average_slope/100
inclination<-atan(0.03663)*(180.0/pi)
#####inclination
df_NEm$km_ascending<- (df_NEm$Distance_Slope_Km-df_NEm$Distance_Slope_Km*cos(inclination*pi/180))/sin(inclination*pi/180)
FBW<-340 #kg 750lb steer Full Body Weight
#calculate NEm for each day
df_NEm$NEmr_act<- (0.1*df_NEm$Rest_Min+0.062*Position_Change+0.621*df_NEm$Distance_flat_Km+6.69*df_NEm$km_ascending)*FBW/1000
#plot daily
ggplot(df_NEm,aes(x=Date,y=NEmr_act))+
geom_bar(stat="identity", fill="steelblue")+
theme_minimal()+
xlab("\nDate")+
ylab('Mcal/day\n ')+
ggtitle("Daily Net Energy Required for Physical Activity")